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The general relationship between an arbitrary frequency distribution and the expectation value 
of the frequency distributions of its samples is esablished. A set of combinations of expectation 
values whose value does not in general depend on the size of the sample is constructed. Distribution 
functions such that the distribution of the expectation values of their samples is invariant in form 
are found and studied. The conditions under which the scaling limit of such distributions may exist 
are described. 

PACS numbers: 



I. INTRODUCTION 

In many interesting physical, biological and social phenomena, whenever no intrinsic scale for the relevant variables 
is present, the emergence of "scaling laws" is phenomenologically observed [ij. However, strictly speaking, a power law 
is not a proper way of fitting empirical data, since no choice of the exponent can keep the higher moments of a power 
law distribution from diverging, while every phenomenological distribution leads to finite values for all moments. 

This is not just a technicality: it is rather a refiection of the fact that the long tail of a power law distribution is 
in practice cutoffed by the existence of some "hidden" scale, irrelevant in the scaling region, but eventually forcing 
some upper limit on the variables describing the system. It would therefore be convenient to be able to parametrize 
the data by means of more regular distribution functions, sufficiently damped for very large values of the variables, 
but at the same time admitting power law distributions as regular limits when some control parameter implementing 
the cutoff is sent to its limiting value. 

A second, but not unrelated issue has to do with the effects of sampling, which may be non trivial even when we 
restrict our attention to the expectation values of the sampled variables. On average sampling does not affect the 
distributions of individual objects belonging to different kinds, but when we consider frequency distributions (that is 
the number of kinds that are represented k times in a given population) we cannot in general expect that the frequency 
distribution in the samples be the same as in the original population, even after averaging on many different samples, 
basically because the cutoff induced by sampling acts differently (and in general nontrivially) at different scales. 

Our purpose is therefore threefold. First we want to explore the general relationship existing between some arbitrary 
frequency distribution and the expectation value of the frequency distributions of its samples. 

Moreover we want to study classes of distributions whose samples preserve (at least approximately) the functional 
dependence on the parameters present in the original distribution, establishing the connection between the (a priori 
unknown) values of the parameters of the distribution and the (empirically measured) parameters of the sample 
distributions. 

Finally we want to study the scaling limit of these distributions (when it exists), in order to explore the possibility 
of their use for the phenomenological description of systems that are theoretically expected to show scaling in the 
limit when all empirical cutoffs are going to disappear. 

In Section II we establish the notation and the general framework of our analysis, finding a rather explicit math- 
ematical relationship between the generating function of the expectation values of the sample distributions and the 
generating function of the original distribution. 

In Section III we construct an infinite set of combinations (" invariant moments" ) of expectation values whose value 
does not depend on the size of the sample. 

Evaluating the invariant moments on the samples leads therefore to the possibility of performing a formal recon- 
struction of the original distribution. 

In Section IV we briefiy discuss (as a corollary of the results presented in Sec. Ill) the issue of correlation between 
samples, which must be checked against its theoretical value as an important test of randomness in sampling. 

In Section V we analyze a class of distributions (the so-called negative binomial distributions) admitting a scaling 
limit and enjoying the property that the distribution of expectation values of the samples has the same mathematical 
form as the original distribution. We also compute in a closed form the values ot the invariant moments for these 
distributions. 

In Section VI we consider two relevant generalizations, finding a quite general class of distributions sharing the 
property of invariance in form in the case of sampling, and describing the properties of some distributions that are 
not invariant in form for arbitrary values of the parameters but nevertheless admit a scaling limit. 



2 



Finally in Section VII we analyze the scaling limit itself and discuss the conditions under which one may expect 
this limit to be a sensible description of the original system. 



II. THE GENERAL FRAMEWORK 



We are considering a set of A'' objects ("individuals") belonging to S different kinds ("species"), and we assume 
that the set contains Na objects of the a-th kind, subject to the constraint 

A sample is a set of n objects, containing ria objects of the o-th kind, subject to the constraint 

s 

Ug = n. 

a=l 

The probability P{na} of extracting a specific sample {na} from a given set {Na} is easily obtained and it takes the 
form 

We can easily compute the relevant expectation values, obtaining in particular 

< Ua >= Na— = npa, 



< nl >= NaiNa - + iV„- = ^n(n - l)pl + (l - ^)np„, 

where Pa = Na/N is the probability of extracting an object of the a-th kind in a single extraction. 

It may be useful to consider also the limit of small samples << Ng. In this limit the probability of a specific 
sample is well approximated by the expression 



I In 



a=l 

Notice that the condition < >= npa is preserved by this approximation, while 

< nl >-*• »^(»^ - l)Pa + npa- 

A frequency distribution is a set of values {A^fe}, where Nk is the number of kinds such that for each of them there 
are k objects in the original set. According to the definition, the following conditions must be satisfied: 

N N 



fe=i fe=i 
The frequency distribution of a sample is a set of values {n^}, satisfying the conditions 

n n 

^ n; = 5, y^Jni =n. 

1=0 1 = 1 

Notice that the frequency distribution of a sample includes the value no, corresponding to the number of kinds, 
present in the original set, which are not represented in the sample. 

It is in principle possible to compute the probability of any sample distribution {ri;} as a function of a given set 
{iV/c}. To this purpose it is convenient to define the intermediate variables N^i, representing the (random) number 
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of kinds characterized by k objects in the original set and hy I {I < k) objects in the sample. The variables N^i are 
strongly constrained, since they must satisfy all the conditions: 



N 



'^Nki = ni. 



1=0 



fe=i 



The probability P{Nki} ^ specific configuration {Nki} follows from the general probability formula: 



-1 N 



fe=i L ;=o 



subject to the constraint X]r=o = -^fc- 

The probability of finding a frequency distribution {ni} in a sample is then obtained by summing the probabilities 
P{Nki} over all configurations satisfiying the constraint J2k=i ^ki = f^l- 

It may be convenient to define a multivariable generating function for the probability of the frequency distributions 



n ^ ^ {„,} I n ^ ^ {Nki} I 



By applying the explicit expression of P{Nki} ^^id all the constraints one may obtain the relationships: 



In practice it does not seem to be possible to obtain simple closed formulas for -P{„,}. However we shall be 
interested only in the expectation values < > of the frequency distribution, and these can be computed rather 
explicitly starting from the above expressions and from the relationship 

N N 

<ni>=^< Nki >= E E ^kiP{N^^}- 

fe=l fe=l {Nj^} 

Straightforward manipulations lead to the results 

(t)C:f) 



(0 



< Nki >= Nk- 



< ni >-- 



It is easy to check that the following relationships are satisfied: 

JV 



n N n N 

^<ni>=^Nk = S, ^l<ni>=i^kNk)^=n. 



1=0 



k=l 



1=0 



k=l 



In order to fully appreciate the relevance of considerations based on the expectation values we must evaluate the 
weight of the fluctuations. Taking second derivatives of the generating function E[x; U) one may obtain the expression: 



Krif > - <ni 



k,k' 



k\ fk 



IJ\ I 



V n-2l ) _ \ n-l I \ n-l ) 



E^^ 



\ n-l ) n^Xy n-2l ) 



A very important limit of the above result may be obtained when considering the (rather typical) case k,l « N,n. 
In this limit 



N 



<ni>=Y^ Nk 



k=l 



k\ , n 



irN' 



N 



k=l 



where the definition of P^i follows from the above equation, and one may check that the conditions on Y^^=o < '^i > 
and on Ym=o ^ < > are still satisfied. 
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We can also estimate the behavior of fluctuations when k,l « N,n: 

k k 

It is easy to check that the above expression is always smaller than < n; > and as a consequence fluctuations 
become irrelevant for sufflciently large values of < n; >. 

In the same limit we may derive a very important relationship between the generating function of the original 
frequency distribution and the generating function of the expectation values of its samples. Let us start by deflning 

N n 

F{t) = Y,Nkt\ f{t) = Y,<ni>t\ 

fe=l l=Q 

and let's notice that the above derived expression of < ni >, when replaced in the definition of f{t), after exchanging 
the order of summations and performing a summation on the index I leads to 

N 
k=l 

Even more conspicuously, by introducing a new variable z and defining 
we obtain the relationship 

g{z) = G{z). 

Defining 7(2) = G{z) — G(0) and recalling that G{N) — and G(n) = /(O) we then easily obtain 

F{t)=^[N{l-t)] -7W, 

f{t)-f{0) = j[n{l-t)] -7(n). 

As a consequence, whenever the (size-independent) function 7(2) can be cast into a form exhibiting no explicit 
parametric dependence on N, the expectation values < ni > can be obtained from Nk simply by the replacement 
N ^n. 

III. INVARIANT EXPECTATION VALUES 

It is very important to be able to define a set of expectation values that are independent of the size of the sample, 
and therefore may reflect very directly the properties of the original frequency distribution. 
Let's consider the combinations 

i\ ^N\^^J^J^_^l\^k\^N-k 



and exploit the fact that 



and 



l=p k=p l=p 



^ 'k-p\(N- k\ (N - p\ fN\ fn\ /7V^ ^ 



E : 



p J \ n — I J \n— p J \n J \pj \p 
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to prove that 



Hence the following equations hold for all p <n: 

In - Dl! ^-^ / l\ IJV - Bl! ' ^ 

< mi."'' >=< 

n! 

l—p k=p 



The expectation values of the "moments" rrip evaluated for samples of arbitrary size n coincide with the "moments" 
Mp of the original frequency distribution, as long as p < n. 

Notice that in the limit k,l « N,n the definition of simplifies to 



P nP ^ \p 



(n) 

and the fact that < nip > are independent of the size of the sample then follows as a trivial consequence of the 
relationship g{z) = G{z), holding in the same limit and allowing the interpretation of G{z) as the generating function 
of the invariant moments. It is worth analyzing the explicit expressions of the first few invariant moments: 

N n 



Mo = ^Nk = ^ < ni >= S, 



N 



1 1 



hence 



N ^ n 

fe=l 1=0 



N n 
k=l 1=0 



2M. = i = - 4 = E < (-)' > --' 

a N ' N ^ ^ n ' n 

a— 1 a— 1 



where we introduced the notation a = 1/ (2M2) in order to make correspondence with the literature. 

IV. CORRELATION BETWEEN SAMPLES 



An important test of randomness in sampling is offered by the measure of the correlation between two different 
samples. Let's consider two random samples, characterized by the sets of values {no} and {ma} and by their sizes n 
and m. The index a labels different kinds, as in Section II. 

The correlation between the two samples is 



Replacing Ua and with their expectation values, computed in Section II, we then easily obtain (in the large N 
limit) 



^ < na >< rua >= nm'^pl, 
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a—l a—1 a—1 a—1 

By making use of the results presented at the end of the previous Section we can now express the expected value 
of the correlation between samples in the form 

- + - 

<C>= 



a n \ a m 



For samples of equal size n we can represent the expected value of the correlation in the form: 

n a + N 



a + n N 



V. A CLASS OF DISTRIBUTIONS AND ITS PROPERTIES 

For our purposes it is especially interesting to consider the class of negative binomial distributions which can 
be obtained starting from the generating function 

Nil^xf-^,^ ,^ ^^^^^^ _^iV(l_2;)l-Cr(fc_c) , 



X r(i - c) 



fc! 



X 



where < a: < 1 and the parameter c is assumed to vary in the range < c < 1. 

The asymptotic behaviour of the distribution for large fc is easily obtained by observing that in this limit 

r(fc-c) 1 NiX-xY-" x^ 



fc! fci+^' X r(i-c) fci+':' 

We can now compute the generating function for the expectation values of the samples according to the general 
rule previously discussed, and obtain 

/c(<) - /c(o) + - ^^'^^''' [1 - (1 - vtr\ , 

y c ^ 

where we have defined 

y 



l-x + ^x 



The distribution of the samples has the same form as the original distribution, once the replacements N ^ n and 
X ^ y have been performed, and therefore we obtain the asymptotic behaviour 

n(l-j/)i-= / 

ni 



y r(l-c) fci+'=' 

It is possible to define a combination of parameters independent of the dimension of the sample: 

p — N = n , 

X V 

and it is useful to represent x and y in a form showing explicitly their dependence on the dimension of the sample 
and on the invariant parameter /3: 



N 



y 



13 + N' " 13 + n 

It is now possible to evaluate the invariant moments from the expression 



7e(z) ^ G,(z) - GM = ^ [l - (l + f ) 
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showing no explicit parametric dependence on N, as expected; we therefore obtain (for p 0) 

p^-p r(p-c) 1 _ 1 13 



For completeness let's observe that 



r(l - c) ' 



2M2 



1 - c 



/3| 



S = Ge(0) = = /,(!) = ^ [(1 - x)-^ - 1] , 



/c(l)-/c(0) = ^[(l-y)-=-l], 



and as a consequence the expected number of kinds not appearing in a given sample is 



,10 ^ /c(0) = S-s = ^{{\^ x)-' - (1 - y)-^] . 
c 

The limit of the above results when c is smooth, and it corresponds to Fisher distribution [sj , such that 



and 



Fo(i) - -/31n(l-a;t), 



/o(i)=/31n(^i^)-/31n(l-yi), 



+ n 

The generating function of the invariant moments is obtained from 

7oW = -/31n(l + |), 

and as a consequence the invariant moments (j> ^ 0) are exactly Afp — P^^^ /p. 
Notice in particular the relationship (3 — a, peculiar to Fisher distribution. 

VI. MORE GENERAL DISTRIBUTIONS 



Negative binomial distributions are only a special instance of a much wider class of distributions whose generating 
function can be represented in the form 

(/9'(1) L ^ \ — X ^ 1 — X ' 

where Lp is an arbitrary function of its argument (subject only to the constraints deriving from the positivity of all 
Nk) and a; is related to N and /3 as in the previous Section. 

We can easily evaluate the generating function of the invariant moments, finding 



7(z) 



/3 



^(1)-(^(1 + -) 



and therefore we obtain 



(/?'(!) L n-y' l-y ' 

showing explicitly that the distribution of expectation values in the samples preserves the the form of the original 
frequency distribution. 

li is worth noticing that, whenever ip{u) — u'^ in the limit u — > 1, the asymptotic behaviour of the distribution for 
large values of k will be the same as in Section V, independent of the detailed behaviour exhibited for small k. It is 

also worth considering the class of generating functions admitting the representation 

N il;{xt) 



X tp'ix) 
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In the general case the sample distribution functions will not preserve strictly the form of the original distribution, 
but once again, admitting that, in the limit t ^ 1, ^(1 — £,) ^ S,'^, one can show that the relevant asymptotic 
behaviours are 



where we have introduced the invariant coefficient 

_ 1 c{i-xy-^ 
r(i-c) i^'ix) ' 

and it is possible to show that, independent of the detailed form of -0, the limit of A{x) when a; — ?■ 1 must always be 
equal to l/r(l - c). 



VII. THE SCALING LIMIT 



Let's now consider very large systems, and assume that we can gather information only through the sampling of n 
objects belonging to the system, with n large but not necessarily comparable to N. 

The analysis of the invariant moments may then allow us to check the applicability of a phenomenological description 
of the samples based on some distribution falling into the classes discussed in the previous Sections. In the case of a 
positive response to the check it is then possible to find numerical estimates of the parameter /? and of the exponent 
c (as well as of the coefficient A{x), if present). Such estimates are clearly meaningful only if /3 does not turn out to 
be significantly greater than n. 

Under these assumptions, we can infer a description of the original system, and in the case N » n such a 
description will correspond to computing the limit a; — 1 of the previous results. As a consequence, at least for 
observable (i.e. not too large) values of k, the original distribution is expected to be well described by the scaling 
form 

r(i-c) k^+-- 
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